This Jupyter Notebook aims to fulfil the following:
1) Import human connectome matrices of varying age groups
2) Perform community detection using a weighted stochastic block model (WSBM) formulation
3) Measure the differences in community structure using variation of information (VI).

Importing

Libraries

Connectivity Matrices

Here we import the connectivity matrices from the '811_Functional_Connectomes' folder (this was originally called 1000_Functional_Connectomes, but the folder only contains 811 files). In particular, we seek connectivity matrices that correspond to certain age groups. Since there are 881 matrices, selecting these by hand would be intractable, so we build some code that obtains the appropriate matrices.

Young Participants

We classify young participants as being 20-25 years of age inclusive.

Here we obtain the matrices of the young individuals

Now we average across all such matrices

Save this result:

Load this result:

Mid-Aged Participants

We classify middle-aged participants as being 45-55 years of age inclusive.

Here we obtain the matrices of the middle-aged individuals

Now we average across all such matrices

Save this result:

Load this result:

Old Participants

We classify old participants as being 70 years of age or higher.

Here we obtain the matrices of the old individuals

Now we average across all such matrices

Save this result:

Load this result:

Metadata

We obtain the region names and other subnetwork information

Visualisation

Heat Map

We now visualise the matrices using a heatmap via the Seaborn package.

Young Participants

Mid-Aged Participants

Old Participants

Immediate conclusions:

Edge Weight Distribution

This serves as another visualisation tool for each age group.

Graphs

Based on the work proceeding, it was found that one network measure that well separates the age brackets is the strength of the connections, since it showed clearly that the old age bracket had weaker connections on average.

Converting the Matrices into Graphs

We will not visualise the graph induced by the adjacency matrices because they are simply too large to visualise in this manner.

We create a function that converts the brain metadata into a dictionary, so this can be used in the network visualisation.

Young Graph

Mid-Aged Graph

Old Graph

No inference can be made from these displayed graphs; they serve as a general visualisation of the connected regions of the brain only.

Network Measures

We use some well-known network measures to establish additional differences between the connectivity matrices. These are:


Some network measures have prerequisites. The assumption is that our network is not fragmentmented, which would prohibit measures such as closeness centrality and the average shortest path.

Density $\rho$

The density of a graph is given by the following: $$ \rho_G = \frac{M}{\frac{1}{2}{N \choose 2}}$$ where $M$ is the number of true edges and $N$ is the total number of nodes.

Of course, if the graph is fully connected, albeit with weak connections, then this density is 1 automatically. We show this below. For this reason, we make the graphs sparser by setting different thresholds to remove weak connections, and then assess the density from there.

Conclusions:

Mean Strength $\langle k \rangle$

The mean strength of a graph is given by: $$ \langle k \rangle_G = \frac{1}{N}\sum_{i=1}^{N} \left(\frac{1}{N-1}\sum_{j\neq i}w_{ij}\right)$$ where $w_{ij}$ is the undirected strength from $i \rightarrow j$, and the division by $N-1$ is because every node $i$ has $N-1$ other nodes $j$ in the graph to connect to.

I anticipate this measure will not be that informative; despite the different generations of brains having different connections, their overall average connection strength will probably be the same (perhaps slightly weaker for older individuals on average).

Conclusions:

Centralities

Clustering Coefficient Distribution

Clustering coefficient $C_G$, defined as: $$ C_i := \frac{\text{Number of triangles including the $i^{\text{th}}$ node}}{k_i(k_i-1)/2}$$ where $k_i$ is the total strength of connections to node $i$: $k_i = \sum_{j\neq i}w_{ij}$.

Thismeasures a tendency of any two neighbours of a node to be directly connected, thus forming a triangle shape.

We create a normalised histogram to represent the distributions of the clustering coefficients.

Conclusion: The middle-aged density is more skewed to higher values than the other age groups, suggesting a stronger connection between groups (triangles) of nodes.

Betweenness Centrality Distribution

This is best defined in words, since it is already implemented in NetworkX: betweenness$_i$ = the fraction of shortest paths passing through node $i$ out of all shortest paths in the graph.

We create a normalised histogram to represent the distributions of the betweenness centralities.

Closeness Centrality Distributions

This is defined as: $$ \text{closeness}_i := \left( \frac{1}{N-1}\sum_{j\neq i} d(v_i, v_j) \right)^{-1} $$ where $d$ is the geodesic distance. Hence, it is the inverse of the average geodesic distance from node $i$.

Now, to make this measure to be non-trivial in a weighted context, we can change the notion of distance based upon these edge weights. We do this by setting the distance to be the inverse of the correlation value, since more highly correlated nodes are seen as closer.

We create a normalised histogram to represent the distributions of the closeness centralities.

Conclusion: Similarly to the clustering coefficient, the middle-aged density is more skewed to higher values than the other age groups, suggesting a stronger connection between groups (triangles) of nodes.

Path Lengths

Shortest Path Lengths

We calculate the shortest path length between any two nodes in the network, which in a weighted graph is found by the minimum sum of the weights of connections between two given nodes.

We can average this measure across all possible pairs of nodes, giving the average shortest path length, which has the intuition of overall brain signal transportation efficiency.

Conclusion: The middle-aged connectivity networks have the lowest average shortest path, indicating that the brain signal efficiency peaks in this age bracket.

Diameter $D$

The diameter $D$ is defined by: $$ D := \max_{u,v \in V} d(u,v)$$ where $d$ is the shortest distance between nodes $u$ and $v$.

Note: The default NetworkX implementation of diameter does not include any option for edge weights in the distance calculation, so we do this manually using the shortest distance dictionaries found above.

Conclusion: Interestingly, this is not the same conclusion as the average shortest path. We find that this diameter is decreasing with age, indicating that although the average signal efficiency may be lower in the old age bracket, the most inefficient connections are optimised in the old age bracket.

WSBMs

Now we have some comparative properties of the connectivity matrices of different ages, we can proceed to the main section of this project: detecting community structures of the brain by using an extended framework of the stochastic block model for weighted networks.

Aggregating the Matrix Data

Our Bayesian WSBM will be founded on the assumption that the edge weights are $\geq 0$, as this allows for a simpler likelihood function later on. Thus an essential step before we implement the Bayesian WSBM is that we take the absolute value of all entries of the adjacency matrices.


The next step is to aggregate the matrices from 177x177 to 44x44, as this avoids NaN problems with the likelihood later (due to too many inter/intra partition connections). We do this by:
1) Removing the final row/column so the matrix is now 176x176, since 176 mod 4 = 0.
2) Performing maximum aggregation on 4x4 blocks, which produces a 44x44 matrix.


These two steps are done with the following function:

We now apply this function to our observed matrices:

Young Matrix

Middle-Aged Matrix

Old Matrix

Weighted Matrix Likelihood Function

This function will calculation the joint likelihood of the weights, given a partition vector $\vec{b}$ and parameters $\alpha, \beta >0$.

If $X_{r,s} = \{a_{ij} \: : \: (b_i,b_j)=(r,s)\}$ is the set of weights between nodes $i$ in partition $r$ and nodes $j$ in partition $s$, then we have likelihood: $$ f(X_{r,s}|\vec{b},\alpha,\beta) = \frac{\Gamma(m_{rs}(X_{r,s})+\alpha)\beta^{\alpha}}{\Gamma(\alpha)(\mu_{rs}(X_{r,s})+\beta)^{m_{r,s}(X_{r,s})+\alpha}}$$ where: $$ \mu_{rs}(X_{r,s}) = \sum_{i : b_i=r, \: j : b_j=s} \frac{x_{ij}}{1+\delta_{rs}} $$ counts the sum of the weights of connections between partitions $r$ and $s$ (but halves this when $r=s$ to avoid double counting by the handshaking lemma), and: $$ m_{rs}(X_{r,s}) = \frac{\# \{b_i=r\}\cup \{b_j=s\}}{1+\delta_{rs}}$$ counts the number of members of partitions $r$ and $s$ (but halves this when $r=s$ for the same reason as above).

To keep the weights generally in the range $(0,1)$, we fix $\alpha=5$ and $\beta=1$, although these can be found via a MLE approach too.

We find typical values of $m_{rs}$ are 20-30, which is small enough for factorials to be done tractably.

By our working in the essay, as each partition pair $(r,s)$ is independent, the overall likelihood function is a product over all such $1 \leq r \leq s \leq B$, where $B$ is the number of partitions that we will find later using a Bayes factor.

Calculating the Optimal Partition Vector

As explained in the essay, the posterior of $\vec{b} \: | \: A$ is intractable due to the normalising constant, thus we use a MCMC method to find an optimal value $\vec{b}^*$. In particular, we use Metropolis-Hastings.